The impact of running headers and footers on proximity searching

نویسندگان

  • Kazem Taghva
  • Julie Borsack
  • Thomas A. Nartker
  • Jeffrey S. Coombs
  • Ron Young
چکیده

Hundreds of experiments over the last decade on the retrieval of OCR documents performed by the Information Science Research Institute have shown that OCR errors do not significantly affect retrievability. We extend those results to show that in the case of proximity searching, the removal of running headers and footers from OCR text will not improve retrievability for such searches.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Header and footer extraction by page association

This paper introduces a robust algorithm to extract headers and footers from a variety of electronic documents such as image files, Adobe PDF files, and files generated OCR. Compared with the conventional methods based on the page-level layout and format, the proposed strategy considers a page in the context of neighboring pages. Through such page-association, the headers and footers on a varie...

متن کامل

Fast in-Place File Carving for Digital Forensics

Scalpel, a popular open source file recovery tool, performs file carving using the Boyer-Moore string search algorithm to locate headers and footers in a disk image. We show that the time required for file carving may be reduced significantly by employing multi-pattern search algorithms such as the multipattern Boyer-Moore and Aho-Corasick algorithms as well as asynchronous disk reads and multi...

متن کامل

Chapter 5 CONTEXT - BASED FILE BLOCK CLASSIFICATION

Because files are typically stored as sequences of data blocks, the file carving process in digital forensics involves the identification and collocation of the original blocks of files. Current file carving techniques that use the signatures of file headers and footers could be improved by first classifying each data block in the storage media as belonging to a given file type. Unfortunately, ...

متن کامل

FIASCO: Filtering the Internet by Automatic Subtree Classification, Osnabrück

The FIASCO system implements a machine-learning approach for the automatic removal of boilerplate (navigation bars, link lists, page headers and footers, etc.) from Web pages in order to make them available as a clean and useful corpus for linguistic purposes. The system parses an HTML document into a DOM tree representation and identifies a set of disjoint subtrees that correspond to text bloc...

متن کامل

Extracting the Main Content from HTML Documents

A modern web document typically consists of many kinds of information. Besides the main content which conveys the primary information, a web document also contains noisy contents such as advertisements, headers, footers, decorations, copyright information, navigation menus etc. The presence of noisy contents may affect the performance of applications such as commercial search engines, web crawl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004